auto_annot_Haber2017_with_Smillie2019_dblabel

In [1]:
import besca as bc
import scanpy as sc
import pandas as pd
import pkg_resources

Specify folders where .h5ad files are found and their names.

The datasets that are already annotated and should be used for training. If you only use one dataset please use list of one.

In [2]:
adata_trains = [sc.read(pkg_resources.resource_filename('besca', 'datasets/data/Smillie2019_processed.h5ad'))]

The dataset of interest that should be annotated.

In [3]:
adata_pred = sc.read(pkg_resources.resource_filename('besca', 'datasets/data/haber_processed.h5ad'))
adata_orig = sc.read(pkg_resources.resource_filename('besca', 'datasets/data/haber_processed.h5ad'))

Give your analysis a name.

In [4]:
analysis_name = 'auto_annot_Haber2017_with_Smillie2019_Cluster' 

Now specify parameters

Specify column name of celltype annotation you want to train on.

In [5]:
celltype_train ='dblabel' 
celltype_test = 'dblabel'

Choose a method:

  • linear: Support Vector Machine with Linear Kernel
  • sgd: Support Vector Machine with Linear Kernel using Stochastic Gradient Descent
  • rbf: Support Vector Machine with radial basis function kernel. Very time intensive, use only on small datasets.
  • logistic_regression: Standard logistic classifier iwth multinomial loss.
  • logistic_regression_ovr: Logistic Regression with one versus rest classification.
  • logistic_regression_elastic: Logistic Regression with elastic loss, cross validates among multiple l1 ratios.
In [6]:
method = 'logistic_regression_elastic' # 'logistic_regression'

Specify merge method if using multiple training datasets. Needs to be either scanorama or naive.

In [7]:
merge = 'scanorama'

Decide if you want to use the raw format or highly variable genes. Raw increases computational time and does not necessarily improve predictions.

In [8]:
use_raw = False

You can choose to only consider a subset of genes from a signature set.

In [9]:
genes_to_use = 'all'

Prepare all training and the testing set.

In [10]:
# Select epithelial subset from Smillie2019 dataset
epithelial_subset = bc.subset_adata(adata_trains[0], adata_trains[0].obs.celltype_highlevel == 'Epi', raw=False)
adata_trains[0] = epithelial_subset
In [11]:
# Convert mouse symbols (MGI) to human symbols (HGNC)
mousehuman_file = pkg_resources.resource_filename('besca', 'datasets/homologs/MGItoHGNC.csv')
mousehuman=pd.read_csv(mousehuman_file,sep='\t',header='infer', encoding="unicode_escape")
mousehuman.index=mousehuman['MGI']
conversion=pd.Series(data=mousehuman['HGNC'], index=mousehuman.index)
In [12]:
# Convert mouse symbols (MGI) to human symbols (HGNC)
adata_orig.var.rename(columns={'SYMBOL':'MGI'}, inplace=True)
adata_orig.var['SYMBOL'] = adata_orig.var['MGI'].map(lambda x: conversion.get(x, default='') if type(conversion.get(x, default='')) == str else conversion.get(x, default=None).values[0])
adata_orig.var.index = adata_orig.var.SYMBOL
adata_orig.var_names_make_unique()
adata_pred = adata_orig.copy()

This function merges training datasets, removes unwanted genes, and if scanorama is used corrects for datasets.

In [13]:
adata_train, adata_pred = bc.tl.auto_annot.merge_data(adata_trains, adata_pred, genes_to_use = genes_to_use, merge = merge)
merging with scanorama
using scanorama rn
Found 278 genes among all datasets
[[0.         0.75798458]
 [0.         0.        ]]
Processing datasets (0, 1)
integrating training set
calculating intersection

Train the classifier.

The returned scaler is fitted on the training dataset (to zero mean and scaled to unit variance).

In [14]:
classifier, scaler = bc.tl.auto_annot.fit(adata_train, method, celltype_train)
[Parallel(n_jobs=10)]: Using backend ThreadingBackend with 10 concurrent workers.
max_iter reached after 89 seconds
max_iter reached after 89 seconds
max_iter reached after 89 seconds
max_iter reached after 92 seconds
max_iter reached after 93 seconds
max_iter reached after 93 seconds
max_iter reached after 103 seconds
max_iter reached after 104 seconds
max_iter reached after 104 seconds
max_iter reached after 110 seconds
max_iter reached after 111 seconds
max_iter reached after 112 seconds
max_iter reached after 112 seconds
max_iter reached after 112 seconds
max_iter reached after 111 seconds
max_iter reached after 111 seconds
max_iter reached after 110 seconds
max_iter reached after 111 seconds
max_iter reached after 104 seconds
max_iter reached after 105 seconds
max_iter reached after 106 seconds
max_iter reached after 104 seconds
max_iter reached after 104 seconds
max_iter reached after 104 seconds
max_iter reached after 102 seconds
max_iter reached after 102 seconds
max_iter reached after 102 seconds
max_iter reached after 101 seconds
max_iter reached after 101 seconds
max_iter reached after 101 seconds
[Parallel(n_jobs=10)]: Done   3 out of   3 | elapsed: 17.2min finished

Prediction

Use fitted model to predict celltypes in adata_pred. Prediction will be added in a new column called 'auto_annot'. Paths are needed as adata_pred will revert to its original state (all genes, no additional corrections). The threshold should be set to 0 or left out for SVM. For logisitic regression the threshold can be set.

In [15]:
adata_predicted = bc.tl.auto_annot.adata_predict(classifier = classifier, scaler = scaler, adata_pred = adata_pred, adata_orig = adata_orig, threshold = 0.7)

Write out metrics to a report file, create confusion matrices and comparative umap plots

In [16]:
adata_pred.obs
Out[16]:
CELL CONDITION sample_type donor region_x sample percent_mito n_counts n_genes batch leiden celltype0 celltype1 celltype2 celltype3 dblabel barcode region_y cell_label _merge
0 haber_intestine_donor_M1_Duo.AAACATACAGCGGA healthy mouse_small_intestine_epithelial M1 Duo Duo_M1 0.001410 12768.0 1227 Duo 22 epithelial cell paneth cell paneth cell paneth cell paneth cell AAACATACAGCGGA Jejunum Paneth both
1 haber_intestine_donor_M1_Duo.AAACATACCTTACT healthy mouse_small_intestine_epithelial M1 Duo Duo_M1 0.010779 6583.0 2156 Duo 3 epithelial cell enterocyte enterocyte enterocyte enterocyte AAACATACCTTACT Jejunum Enterocyte both
2 haber_intestine_donor_M1_Duo.AAACCGTGCAGTCA healthy mouse_small_intestine_epithelial M1 Duo Duo_M1 0.022508 2799.0 1362 Duo 1 epithelial cell epithelial fate stem cell proliferating epithelial fate stem cell proliferating epithelial fate stem cell proliferating epithelial fate stem cell AAACCGTGCAGTCA Jejunum TA both
3 haber_intestine_donor_M1_Duo.AAACGCTGCAGTCA healthy mouse_small_intestine_epithelial M1 Duo Duo_M1 0.015041 6048.0 2287 Duo 9 epithelial cell transit amplifying cell transit amplifying cell transit amplifying cell transit amplifying cell AAACGCTGCAGTCA Jejunum TA both
4 haber_intestine_donor_M1_Duo.AAACGCTGCGTGAT healthy mouse_small_intestine_epithelial M1 Duo Duo_M1 0.023022 2780.0 1320 Duo 17 epithelial cell transit amplifying cell transit amplifying cell transit amplifying cell transit amplifying cell AAACGCTGCGTGAT Jejunum Stem both
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10891 haber_intestine_donor_M2_Il.TTTCAGTGACCAGT healthy mouse_small_intestine_epithelial M2 Il Il_M2 0.008591 5468.0 2037 Il 9 epithelial cell transit amplifying cell transit amplifying cell transit amplifying cell transit amplifying cell TTTCAGTGACCAGT Ileum EP both
10892 haber_intestine_donor_M2_Il.TTTCGAACAGAACA healthy mouse_small_intestine_epithelial M2 Il Il_M2 0.007760 10174.0 2884 Il 31 epithelial cell immature enterocyte immature enterocyte immature enterocyte immature enterocyte TTTCGAACAGAACA Ileum Enterocyte both
10893 haber_intestine_donor_M2_Il.TTTCTACTGCTCCT healthy mouse_small_intestine_epithelial M2 Il Il_M2 0.006121 9307.0 2856 Il 10 epithelial cell immature goblet cell immature goblet cell immature goblet cell immature goblet cell TTTCTACTGCTCCT Ileum Goblet both
10894 haber_intestine_donor_M2_Il.TTTGACTGCGCCTT healthy mouse_small_intestine_epithelial M2 Il Il_M2 0.007189 4029.0 1277 Il 0 epithelial cell goblet cell goblet cell goblet cell goblet cell TTTGACTGCGCCTT Ileum Goblet both
10895 haber_intestine_donor_M2_Il.TTTGCATGGAGGAC healthy mouse_small_intestine_epithelial M2 Il Il_M2 0.009316 8907.0 2277 Il 3 epithelial cell enterocyte enterocyte enterocyte enterocyte TTTGCATGGAGGAC Ileum Enterocyte both

10896 rows × 20 columns

In [17]:
%matplotlib inline

bc.tl.auto_annot.report(adata_pred=adata_predicted, celltype=celltype_test, method=method, analysis_name=analysis_name,
                        train_datasets=adata_trains, test_dataset=adata_orig, merge=merge, use_raw=False,
                        genes_to_use=genes_to_use, remove_nonshared=True, clustering='leiden', asymmetric_matrix=True)
... storing 'auto_annot' as categorical
... storing 'SYMBOL' as categorical
WARNING: saving figure to file figures/umap.ondata_auto_annot_Haber2017_with_Smillie2019_Cluster.png
WARNING: saving figure to file figures/umap.auto_annot_Haber2017_with_Smillie2019_Cluster.png
Confusion matrix, without normalization
Normalized confusion matrix
In [18]:
sc.pl.umap(adata_predicted, color=[celltype_test, 'auto_annot'])
sc.pl.umap(adata_predicted, color=[celltype_test, 'auto_annot'], legend_loc='on data', legend_fontsize=8)
In [19]:
sc.pl.umap(adata_predicted, color=[celltype_test, 'auto_annot'], legend_fontsize=7, wspace = 1.4, save = '.svg')
sc.pl.umap(adata_predicted, color=[celltype_test, 'auto_annot'], legend_loc='on data', legend_fontsize=7, wspace = 1.4, save = '.ondata.svg')
WARNING: saving figure to file figures/umap.svg
WARNING: saving figure to file figures/umap.ondata.svg
In [20]:
adata_train
Out[20]:
View of AnnData object with n_obs × n_vars = 46102 × 278
    obs: 'CELL', 'Cluster', 'Health', 'Location', 'Subject', 'celltype_highlevel', 'nGene', 'nUMI', 'original_name', 'percent_mito', 'n_counts', 'n_genes', 'batch', 'leiden', 'dblabel', 'celltype', 'cluster_celltype', 'Type'
    var: 'SYMBOL', 'ENSEMBL-0', 'n_cells-0', 'total_counts-0', 'frac_reads-0', 'ENSEMBL-1', 'n_cells-1', 'total_counts-1', 'frac_reads-1', 'MGI-1', 'highly_variable-1', 'means-1', 'dispersions-1', 'dispersions_norm-1', 'mean-1', 'std-1'
In [22]:
bc.pl.riverplot_2categories(adata_predicted, [celltype_test, 'auto_annot'])
In [23]:
adata_predicted_wo_unknown = adata_predicted.copy()
adata_predicted_wo_unknown = bc.subset_adata(adata_predicted_wo_unknown, adata_predicted_wo_unknown.obs.auto_annot != 'unknown', raw=False)
bc.pl.riverplot_2categories(adata_predicted_wo_unknown, [celltype_test, 'auto_annot'])
In [24]:
from sinfo import sinfo
sinfo()
-----
anndata             0.7.4
besca               2.2+80.g4c78e2a.dirty
pandas              1.0.5
pkg_resources       NA
plotly              4.5.4
scanpy              1.6.0
sinfo               0.3.1
sklearn             0.22
-----
IPython             7.16.1
jupyter_client      6.1.6
jupyter_core        4.6.3
notebook            6.0.3
-----
Python 3.7.6 | packaged by conda-forge | (default, Jun  1 2020, 18:46:49) [GCC 7.5.0]
Linux-3.10.0-693.11.6.el7.x86_64-x86_64-with-centos-7.4.1708-Core
24 logical CPU cores, x86_64
-----
Session information updated at 2020-12-18 15:08